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Abstract: We propose a new method for the calculation of the statistical 
properties, as e.g. the entropy, of unknown generators of symbolic sequences. 
The probability distribution p(k) of the elements A; of a population can be 
approximated by the frequencies f(k) of a sample provided the sample is long 
enough so that each element k occurs many times. Our method yields an ap- 
proximation if this precondition does not hold. For a given f(k) we recalculate 
the Zipf-ordered probability distribution by optimization of the parameters of 
a guessed distribution. We demonstrate that our method yields reliable results. 
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1 Introduction 



Given a statistical population of discrete events k generated by a stationary 
dynamic process, one of the most interesting statistical properties of the pop- 
ulation and hence of the process is its entropy. If the sample space, i.e. the 
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number of different elements which are allowed to occur in the population, 
is small compared with the size of a drawn sample one can approximate the 
probabilities p(k) of the elements k by their relative frequencies f(k) and one 
finds for the observed entropy H obs 

H = hgp(k) « - J2 /(*) log /(*) = H obs . (1) 

k k 

If the number of the allowed different events is not small compared with the 
size of the sample the approximation p(k) ~ f(k) yields dramatically wrong 
results. In this case the knowledge of the frequencies is not sufficient to deter- 
mine the entropy. The aim of this paper is to provide a method to calculate 
the entropy and other statistical characteristics for the case that the approxi- 
mation ([!]) does not hold. 

An interesting example of such systems are subsequences (words) of length n 
of symbolic sequences of length L written using an alphabet of A letters. Exam- 
ples are biosequences like DNA (A = 4, L^IO 9 ), literary texts (A ~ 80 letters 
and punctuation marks, L^IO 7 ) and computer files (A = 2, L arbitrary). For 
the case of biosequences there is a variety of A n = 1,048,576 different words 
of length n = 10. To measure the probability distribution of the words di- 
rectly by counting their frequencies we need at least a sequence of length 10 8 
to have reliable statistics. Therefore the ensemble of subsequences of length n 
is a typical example where the precondition does not hold. To illustrate the 
problem we calculate the observed entropy H^} for iV words of length n in 
a Bernoulli sequence with A = 2 where both symbols occur with the same 
probability. The exact result is = nlogA. In figure [l] we have drawn 

the values of and H^ s over n. The observed entropy values are correct 
for small word length n when we can approximate the probabilities by the 
relative frequencies. For larger word length, however, the observed entropies 
are significantly below the exact values, even for very large samples (circles: 
N = 10 6 , diamonds: N = 10 4 ). 
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Fig. 1. The observed entropy for iV words of length n from a Bernoulli sequence. 
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Under several strong preconditions the probabilities of words in sequences can 
be estimated from the frequencies using various correction methods Jl]-Q . The 
advanced algorithm proposed in || is based on a theorem by McMillan and 
Khinchin Q saying that for word length n — > oo the frequencies of the admit- 
ted substrings of a sequence are equally distributed. If one is interested in the 
entropies for finite words, however, the theoretical basis to apply this theorem 
is weak and there is no evidence about the reliability of the results. More- 
over this theorem is proven for Markov sequences only. In sequences gathered 
from natural languages, biosequences and other natural or artificial sources 
it is very unlikely that the probabilities of the words of interesting length, 
e.g. words or sentences for languages, amino acids or elements of the hidden 
"DNA language" for biosequences, are equally distributed. Otherwise we had 
to assume that all English five-letter-words are equally frequent. Certainly 
this is not the case. 



2 Description of the method 

To calculate the entropy of a distribution it is not necessary to determine for 
each event k the probability p(k). It is sufficient to determine the values of 
the probabilities without knowing which probability belongs to which event. 
Generally spoken if we assume to have K events there are K\ different relations 
k <r-> p. We need not to determine one particular (the correct relation) but only 
one arbitrary of them. Hence the calculation of the entropy is K\ times easier 
than to determine the probability p(k) for each event k. We assume a special 
order where the first element has the largest probability, the second one the 
second largest etc. We call this distribution Zipf-ordered. Zipf ordering means 
that the probabilities of the elements are ordered according to their rank 
and therefore the distribution p(k) is a monotonically decaying function. The 
following procedure describes a method how to reconstruct the Zipf-ordered 
probability distribution p(k) from a finite sample. 

Provided we have some reason to expect (to guess) the parametric form of 
the probability distribution. As an example we use a simple distribution 
p(k, a, (3, 7) with k = 1,2,... consisting of a linearly decreasing and a con- 
stant part 



Then the algorithm runs as follows: 

i. Find the frequencies F(k) for the N events k and order them according 




1 < k < (3 



p(k) = { (2-a/?)/(2 7 ) 



P < k < 7 
k > 7 . 



(2) 
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to their value (Zipf-order). The index k runs over all different events oc- 
curring in the sample (k G {1 . . . K MAX }). Note: there are N events but 
only K MAX different ones. Normalize this distribution F\{k) = F{k)/N. 
There are various sophisticated algorithms to find the frequencies of large 
samples and to order them (e.g. 0). As in earlier papers we applied 
for finding the elements a "hashing" -method and for sorting a mixed 
algorithm consisting of "Quicksort" for the frequent elements and "Dis- 
tribution Counting" for the long tail of elements with low frequencies. 

ii. Guess initial conditions for the parameters (in our case a, (3 and 7). 

iii. Generate M samples of N random integers (RI™, k = 1 . . . N, m = 
1 . . . M) according to the parametric probability distribution p(k, a, (3, 7). 
In the following examples we used M = 20. Order each of the samples 
according to the ranks fi(k,a, /?, 7) (i = 1 . . .M). Average over the M 
ordered samples 

1 M 

f{k,a,0,i) = —52fi(k,a,P,i) (3) 

IV1 i=l 

with k e {1, k max } and k max = max (k™ ax , (i = 1 . . . M)). Since we want 
to determine the averaged or typical Zipf-ordered distribution, it is im- 
portant to order the elements first and then to average. Normalize the 
averaged distribution of the frequencies 

J2 f(k,a, /3, 7) j f(k,a,Pn) ■ (4) 

fc=0 / 

iv. Measure the deviation D between the normalized averaged simulated 
frequency distribution fi(k, a, ft, 7) and the frequency distribution Fi(k) 
of the given sample according to a certain rule, e.g. 

D = Y [ /l(M / A7) -l) , K = max K MAX \ . (5) 

kTi\ FiW J J 

v. Change the parameters of the guessed probability distribution p(k) (in 
our case the parameters a, [3 and 7) due to an optimization rule (e.g. 0) 
which minimizes D and proceed with the third step until the deviation 
D is sufficiently small. 

vi. Extract the interesting statistical properties out of the probability dis- 
tribution p(k) using the parameters a*, (3* and 7* which have been gath- 
ered during the optimization process. 



3 Examples 
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3.1 Entropy of artificial sequences 



We generated a statistical ensemble N = 10 4 according to the probability dis- 
tribution eq. (|) with a = 9.0 • KT 6 , f3 = 10, 000 and 7 = 50, 000. Fig. | (solid 
lines) shows the probability distribution p(k) and the Zipf-ordered frequencies 

/(*)• 

Optimizing the parametric guessed probability distribution using the proposed 
method we find for the optimized parameters a* = 9.22- 10~ 6 , (3* = 12, 900 and 
7* = 50, 000, i.e. the guessed and the actual distributions fall almost together. 
Since we know the original probability distribution (eq. 0) we can compare its 
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Fig. 2. The probability distribution p(k) (eq. and the Zipf-ordered frequencies 
f(k) corresponding to this distribution. The dashed lines which can almost not 
be distinguished from the solid lines display the guessed distributions. The initial 
distributions before optimization is drawn with wide dashes. 

exact entropy with the entropy of the guessed probability distribution H guess 
and with the observed entropy H b s due to eqs. (HJ^)- 



H, 



gues, 



,{k, a*,P*, 7* 



J2p(k,a\(3*^*)-\ogp(k,a*^*,Y 



H obs (k) = -Y, F i(k)-logF l (k) 



(6) 
(7) 



We found H obs = 
p(fc,a,/?,7) (eq. 



9.0811 and H, 
\) is H 



guess 

10.8188. 



10.8147, the exact value according to 



Now we try to guess a probability of a more complicated form 

a (k 



p(k) 



e)"3 : 1 < k < $ 
(p k- & : (3 < k < 7 
: k > 7 . 
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(As we will show below this function approximates the probability distribution 
of the words in an English text.) The variables a and can be eliminated due 
to normalization and continuity condition. The test sample of size N = 10 4 
was generated using e = 0.9, (3 = 22, 8 — 0.64 and 7 = 70, 000. After the 
optimization we guess the parameters e* = 0.79, j3* = 21.9, 5* = 0.63 and 
7* = 65, 000. Fig. || shows the original and the guessed probability distribu- 
tions and the Zipf-ordered frequencies for both cases. The guessed entropy 
Hguess — 10.5053 approximates the exact value H = 10.5397 very well while 
the observed entropy H obs = 8.8554 shows a clear deviation from the correct 
value. 
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Fig. 3. The original and guessed probability distributions and the Zipf-ordered 
frequencies for the distribution in eq. (||) (N = 10 4 ). 




3.2 Words in an English text 



With the ansatz @ we tried to guess the probability distribution of the words 
of different length n in the text "Moby Dick" by H. Melville g. The text 
was mapped to an alphabet of A = 32 letters as described in |§. Depending 
on overlapping or non-overlapping counting of the words we expect different 
results. We note that overlapping counting is statistically not correct since 
the elements of the sample are not statistically independent, however, only 
overlapping counting yields enough words to get somehow reasonable results 
for the observed entropy. We will show that our method works in both cases, 
overlapping and non-overlapping. Fig. ^ shows the ordered frequencies of N = 
5- 10 4 words of the length n = 6. The optimized distribution eq. (H) reproduces 
the original frequency distribution (Moby Dick) with satisfying accuracy. 

Using the ansatz (|) we found e* = 0.73, (3* = 31, 5* = 0.70 and 7* = 129, 890. 
This calculation was carried out for various word lengths n. Fig. | shows the 
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Fig. 4. Zipf-ordered frequencies of words of length n = 6 in "Moby Dick". The 
curves guessed (a) and (b) (top) display the frequency distributions which have 
been reproduced using the guessed probability distributions in the bottom figure 
according to eq. (||) (a) and eq. @ (b). 

entropies H^] and H^ ess according to eqs. as a function of n. All re- 
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Fig. 5. Observed entropy H^ b ' s and guessed entropy Hg U 'e SS for the text "Moby Dick" 
(N = 5 • 10 ) over the word length n. The circles o display the results of an extrapo- 
lation method described in [6] and the boxes □ show the observed entropy of the text 
using N = 10 6 words, (o) denotes overlapping and (n) non- overlapping counting. 

suits obtained have been derived from a set of 5 • 10 4 non-overlapping words 
taken from the text of length L = 10 6 . When we count overlapping words, 
we find surprisingly that the entropy is quite insensitive (see curves using 
filled and empty diamonds in fig. |5|). The rather difficult problem of overlap- 
ping or non-overlapping counting will be addressed in detail in ||10|| . Since the 
exact probability distribution for the words in "Moby Dick" is unknown we 
compare the guessed entropy (crosses) with the observed entropy (empty di- 
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amonds: overlapping counting, full diamonds: non-overlapping counting) and 
an estimation of the entropy using an extrapolation method (see ||), all based 
on the same set of data (N = 5 • 10 4 ), and with the observed entropy based on 
a twenty times larger set of data (boxes: overlapping counting). As expected, 
for longer word length n the observed values H^ s underestimate the entropy. 
For small n they are reliable due to the reliable statistics. The guessed entropy 
Hguess agrees for small n with the observed entropy and for large n with the 
extrapolated values. 



The form of the guessed theoretical distribution p(k,a,/3, 7, . . .) is arbitrary 
as long as it is a normalized monotonically decreasing function (Zipf-order). 
Suppose that one has no information about the mechanism which generated a 
given sample. Then one has to find the functional form of the guessed distri- 
bution which is most appropriate to a given problem, i.e. in the ideal case the 
guessed distribution contains the real probability distribution as special case 
without being too complicated. An ansatz p(k, a, ft, 7, . . .) is suited if the op- 
timized guessed probability distribution reproduces the frequency distribution 
of the original sample with satisfactory accuracy. 



The ansatz (|5D looks rather artificial: in fact we tried several forms of the 
guessed probability distribution and the one proposed in eq. ([|) turned out to 
be the best of them. None of the others reproduces the frequencies sufficiently 
correct. For demonstration we assume the function 



p(k,a,ft) = < 



a ■ exp(— ft k) : k < 7 
: k > 7 



(9) 



with the normalization 7 = —ft -1 log (1 — ft/a). The optimized function is 
drawn in figure f| (guessed (b)). We find that the frequency distribution re- 
produced from this function differs much more from the original frequency 
distribution (Moby Dick) than that of the guess according to eq. (^|). 



Admittedly any similar ansatz showing a well pronounced peak for low ranks 
(frequent words), a long plateau with slow decrease (standard words) and a 
long tail (seldom words), could give reliable results as well. Anyhow there is 
no wide choice for the parametric form of the probability distribution. Eq. (||) 
belongs to the class of distributions fulfilling this three-region criterion. For 



a more detailed discussion of the statistics of words see e.g. [|ll|] and many 
references therein. 



8 



4 Discussion 



The problem addressed in this paper was to find the rank ordered probability 
distribution from the given frequency distribution of a finite sample. For finite 
samples (Bernoulli sequence and English text) we have shown that the calcu- 
lation of the entropy using the relative frequencies instead of the (unknown) 
probabilities yields wrong results. 

We could show that the proposed algorithm is able to find the correct pa- 
rameters of a guessed probability distribution which reproduces the statistical 
characteristics of a given symbolic sequence. The method has been tested for 
samples generated by well defined sources, i.e. by known probability distri- 
butions, and for an unknown source, i.e. the word distribution of an English 
text. For the sample sequences we have evidence that the algorithm yields 
reliable results. The deviations of the entropy values from the correct values 
are rather small and in all cases far better than the observed entropies. For 
the unknown source "Moby Dick" we have no direct possibility to check the 
quality of the method, however, the calculated entropy values agree for small 
word lengths n with the observed entropy and for larger n with the results 
of an extrapolation method ||. In this sense both approaches support each 
other. The proposed algorithm can be applied to the trajectories of dynamic 
systems. Formally the trajectory of a discrete dynamics is a text written in 
a certain language using A different letters. The rank ordered distribution of 
sub-trajectories of length n belongs to the most important characteristics of a 
discrete dynamic system. In this way we consider the analysis of English text 
as an example for the analysis of a very complex dynamic system. 

In many cases there is a principal limitation of the available data, i.e. the 
available samples are small with respect to the needs of a reliable statistics, 
and hence there is a principal limitation for the calculation of the statistical 
properties using frequencies instead of probabilities. For such cases using the 
proposed method one can calculate values which could not be found so far. 
Given a finite set of data the proposed method yields the most reliable val- 
ues for the Zipf-ordered distributions and the entropies which are presently 
available. The method is not restricted to the calculation of the entropy but 
all statistical properties which depend on the Zipf-ordered probability distri- 
bution can be estimated using the proposed algorithm. 
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